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Suppose that k series, all having the same autocorrelation function, are observed in parallel at n 
points in time or space. From a single series of moderate length, the autocorrelation parameter 
/3 can be estimated with limited accuracy, so we aim to increase the information by formulating 
a suitable model for the joint distribution of all series. Three Gaussian models of increasing 
complexity are considered, two of which assume that the series are independent. This paper 
studies the rate at which the information for jS accumulates as k increases, possibly even be- 
yond n. The profile log likelihood for the model with fc(fc + l)/2 covariance parameters behaves 
anomalously in two respects. On the one hand, it is a log likelihood, so the derivatives satisfy 
the Bartlett identities. On the other hand, the Fisher information for /3 increases to a maximum 
at fc = n/2, decreasing to zero for k>n. In any parametric statistical model, one expects the 
Fisher information to increase with additional data; decreasing Fisher information is an anomaly 
demanding an explanation. 

Keywords: ancillary statistic; Bartlett identity; combination of information; decreasing Fisher 
information; group orbit; marginal likelihood; profile likelihood; random orthogonal matrix 

1. Introduction 

Let xi, . . . ,Xnhe points in space or time. At each point x,;, the fc-variate response Y(xi) = 
(Yii, . . . ,Yik) is measured. The values are recorded in matrix form Y = {Yir} with one 
column for each of the k series and one row for each of the n points. Each series is 
a stationary autoregressive process with autocorrelation parameter /3, and wc aim to 
estimate this parameter as accurately as possible by pooling information from all k 
series. 

Three Gaussian models arc considered, all having moments of the form 

E(K,,)=0, COv{Y,r,Yjs)=T,,^rs (1) 

with autocorrelation function F. The zero- mean assumption is inconsequential and is 
made for simplicity of notation. It can be replaced by a standard multivariate regression 
model (Section 3). The three model variants differ only in the assumptions made about 
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the matrix S, which governs the variances and covariances of the k series. These are as 
follows: 

Model I: i; = cr2/fc, Model II: S = diag{crj, cr^.}, Model III: S e ^fe, 

where PD^ is the space of fc x fc symmetric positive definite matrices. For each model, 
we study the profile log likelihood for /3, show that it satisfies the Bartlett identities and 
study the rate of change of the Fisher information with k and n. 

Model III aims to accommodate correlations among the series in a simple and natural 
way, but for fc > 2?! — 1, the number of parameters exceeds the number of observations. 
This simple counting argument suggests that we might encounter Neyman-Scott phe- 
nomena such as bias, inconsistency or inefficiency in the estimation of /3 (Neyman and 
Scott (1948)). The failure of profile likelihoods to satisfy the Bartlett identities is the chief 
explanation for Neyman-Scott phenomena, and the asymptotic bias can often be elimi- 
nated by a simple adjustment (Bartlett (1953, 1955), Patterson and Thompson (1971), 
Cox and Reid (1987), McCullagh and Tibshirani (1990)). The fact that the profile hke- 
lihood for (3 in models I-III satisfies the Bartlett identities suggests that Neyman-Scott 
phenomena should not arise. This intuition is correct for models I and II. However, the 
marginal likelihood for [3 in model III illustrates a new anomaly for fc > n/2, namely, 
that the Fisher information can be increased by deleting one or more series. 

Although it is sometimes natural, the separability assumption in (1) is very strong, 
even for version III. Stein (1999, 2005) is rightly critical of the use of separable covariances 
for either purely spatial or spatio-temporal processes. However, the product form of the 
covariance function is extremely convenient and widely used, and there do exist applica- 
tions in which this assumption is reasonable. It occasionally happens in agricultural field 
trials that two observations are made on each plot, for example, yield of grain and yield 
of straw. Although the two yields are certainly correlated, there is good reason to expect 
that both processes have very similar spatial autocorrelation functions (McCullagh and 
Clifford (2006)). Mitchell et al. (2006) give further references to applications and develop 
a likelihood-ratio test for separability based on independent replicates of the matrix Y. 
The motivating example for this work arises in a non-spatial context, the estimation of 
a phylogenetic tree for n species from aligned sequences at multiple homologous loci. 
Under the model of neutral evolution, the phylogenetic relationship among species is the 
same at each locus, which implies (1). For further details, see Section 4. 

2. Profile likelihood 

The log likelihood for all three models is 

/(r,S;r) = -iiogdet(r® s) - ^tiiY'r-^YT.-^) 

= -^iog|r| - |iog|s| - itr(r'r-iys]-i), 
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using the formula for the determinant of a Kronecker product (Harville (1997), page 
350). For fixed F, the log likelihood for model III is maximized at Sr = YT^^Y/n. The 
log likelihood for model II is maximized at diag(Er) and the log likelihood for model I 
at tr(Er)/fc/fc. 



The profile log likelihood for F is 








f h 

-|iog|r|- 


^iogtr(yT-iy) 


(Model I), 


lp{T;Y)^< 


-^logiri- 


-iog|diag(r'F-iy)| 


(Model II), 




^-^logiri- 


-iog|y'F-ir| 


(Model III) 



The assumption fc < n is necessary in model III to ensure that the matrix Y'Y^^Y is 
positive definite with probability one. 

The profile log likelihood for model II is a sum over the fc series, the contribution of 
series r being 

1 r) 

--iog|F|--iog(y;F-iy,)- 

This is, in fact, the marginal log likelihood based on the standardized statistic y^/llyrll, 
where Yr is the rth column of Y (Bellhouse (1978), Timnichffe- Wilson (1989), Cruddas, 
Reid and Cox (1989)). 

For a one-parameter model with derivative matrix D = dT/d(3, the derivative of the 
profile log likelihood is 

-kti{WD)+nktT(Y'AY)/tT:{Y'WY) (Model I), 

-ktiiWD) + nJ2{YrAyr)/{Y^WYr) (Model II), 

-ktiiWD) + 7itr{{Y'WY)-'^Y'AY) (Model III), 

where W = F^-^ and A ~ WDW. The quadratic form tT{Y'WY) in model I is distributed 
as cr^Xnk^ independently of the ratio tr{Y' AY)/ tr{Y'WY) (Boos and Hughes-Oliver 
(1998)). The expected value of the ratio is the ratio of expected values, which is 

^/ tT{Y'AY) \ _ fctr(AF) _ ti{WD) 
\tT{Y'WY) J " ^ ~ n ' 

It follows that the log likelihood derivative for model I has zero expectation. The same 
argument applied to each scries leads to the same conclusion for model II. 

The argument for model III is superficially more complicated. For fixed F, the natural 
quadratic form Y'WY is a complete sufficient statistic for E, with expectation nS. The 
statistic iT:{{Y'WY)~^Y' AY) is invariant under the group GL{TV') of linear transforma- 
tions Y i~> Yg acting by right composition. Hence, the distribution does not depend on S. 
By Basu's theorem (Basu (1955)), every ancillary statistic such as tr{{Y'WY)~^Y' AY) 
is independent of Y'WY. Consequently, if we transform to Z = W^I'^Y and condition on 
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the event Y'WY = Z'Z = Ik, the columns of Z are orthonormal, the first k columns of 
a random orthogonal matrix, uniformly distributed with respect to Haar measure on the 
orthogonal group (Heibcrgcr (1978), Stewart (1980), Diaconis and Shahshahani (1994)). 
Hence, 

E[tT{{Y'WYy^Y'AY)] = tT[E{{Y'WY)~^Y'AY)] 

= tr[E(z'ri/2Ari/2z)] 

= kti{rA)/n 
= ktr{WD)/n, 

since ZZ' is a random projection with rank k <n and expectation kin/n. For all three 
models, the first derivative has zero expectation, so the elimination of S by maximization 
has not introduced a bias. 

Similar, but more intricate, calculations for random orthogonal matrices described in 
Appendix A reveal that 



k 

2(7^ 2) 
_fc(n-fc^ (III) 
2(n-l)(n + 2) ^ ^' 



where V = ntr{WDWD) — tP{WD). For model III, this formula holds only for k <n. 
Thus, the second Bartlett identity is satisfied, and it follows from Appendix B that the 
Bartlett identities of all orders are satisfied. 

For small fc, the Fisher information increases roughly in proportion to the number of 
series, all series contributing equally. If, in fact, the series are independent and identically 
distributed, the efficiency of model II to model I is {nk + 2)/{nk + 2k), which is fairly 
high, even for a large number of short series. For example, if n = 10, the relative efficiency 
decreases from 1.0 to 10/12 as fc ^ oo. For fixed k, the relative efficiency increases with 
n, presumably because the number of nuisance parameters in model II is fixed. It appears 
from these calculations that the additional flexibility of model II over model I comes at 
a fairly small cost, so II is likely to be preferred over I in most circumstances. 

The most striking anomaly for large k is that the Fisher information for /3 in model III 
is monotone decreasing for k > n/2 and is reduced to zero for k >n. For a conventional 
one-parameter model with distributions fk {yi , . . . ,yk', P), the Fisher information satisfies 

, / d\ogfk{yk\yi,---,yk-i;l3) \ 
Fife = FIfc_i + var( — 1 > FI^.i, 



so the Fisher information is necessarily non-decreasing in fc. It is immaterial whether the 
components are scalars or vectors. This factorization argument also holds for marginal 
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distributions based on residuals, that is, the REML likeUhood for variance components 
or spatial autocorrelations. It also covers the marginal likelihood for models I and II, 
and conditional likelihoods of the type used to eliminate nuisance parameters in binary 
regression models. However, explicit Fisher information calculations for P in model III 
show that this seemingly impregnable argument may fail. The difficulty lies in the fact 
that the marginal distributions fk of the maximal invariant in model III cannot be 
factored: fk-i is not the marginal distribution of fk under deletion of the last component 
(sec Section 5). 

Bearing in mind the stated goal of increasing precision by pooling information from 
all series, the third formulation is a complete success for small k. But it is a spectacular 
failure for large k because any information about (3 that is present in the first few series 
remains available even when further scries are observed. The marginal likelihood with 
fc > n is constant and thus devoid of information, but the marginal likelihood based on 
any single series or pair of series is informative and the Fisher information is positive. 
A skeptical reader may consider the case k = n, where the matrix Y is invertible with 
probability one. Direct examination of (2) for model III shows that the term dei{Y'WY) 
factors and that the log likelihood does not depend on the parameter. These conclusions 
are independent of the nature of the model for F. 

If k < n, the log likelihood function for model III may be used for inference about 
P, either for computing a point estimate and standard error, for generating confidence 
intervals or for computing posterior intervals. However, if fc > n/2, greater precision can 
be achieved by discarding a random subset of the series and applying the same model to 
the remainder. This counterintuitive behavior is easily verified by simulation. 

3. Regression effects 

The standard model with zero-mean Gaussian variables is easily extended to include 
linear models having non-zero mean. The simplest model of this form is the standard 
Gaussian multivariate regression model. 



where the model matrix X is of order n x p with rank p <n and is a parameter matrix 
of order px k. This model with fc = 2 occurs in field trials where the response is bivariate, 
for example, weight of grain and weight of straw on each plot (McCuUagh and Clifford 
(2006)). The log likelihood based on residuals or X-contrasts (Patterson and Thompson 
(1971), Harville (1977)) is 



where Q = /„ — X{X'WX) ^X'W has rank n — p and Det(-) is the product of the 
non-zero eigenvalues. The profile log likelihood for F in model HI is 



E{Y)^xe, 



cov(y) = F E 



(3) 



[(F, S; r) = ^ logDet(H^Q) - ^ log |S]| - ^ tr{Y'WQY^-'), 



lp{T;Y)^-logDct{WQ) 



max(fc, n — p) 



logDet(r'VFQy). 



2 
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All of the remarks made in the preceding section about the Fisher information hold for 
the profile residual likelihood with n replaced by n — p and W by WQ. 



4. Application to phylogenetics 

The motivating example for this work comes from genetics, where sequence data are 
observed for n species at k homologous loci. In Kim and Pritchard (2007), n = 5 and the 
loci are highly conserved non-coding sequences numbering several thousand. In reality, 
the value at each locus is a sequence from the genetic alphabet, but we assume here for 
simplicity that this can be coded in such is a real number. For locus r, the 

covariance of Yir with Yj> is arr^ij, where cr^r is the site-specific mutation rate and Fjj is 
the length of the ancestral tree that is shared by the two species. Under neutral evolution, 
the genetic distance between species is constant, the same at each locus. Furthermore, 
the responses at different loci may be correlated due to their proximity on the genomes of 
one or more species. The natural Gaussian model is (3) with X = 1, the constant vector. 

Our aim is to estimate the ancestral tree using one of the three variants of (3). The 
profile log likelihood function for model I is 

I{T;Y) = ^logDet(M^Q) - Il^—H^ logtr(y'I^Qy) 

= ^logDet(iyg) - l!i_il^logtr(M^g5) 

= ^ logDet(I^Q) + iH^lh logtr{WQD), 

where S — YY' is the observed inner product matrix and Dij = Su + Sjj — 2Sij is the 
observed squared distance between species. This expression is the log likelihood function 
on phylogenetic trees based on the marginal distribution of the squared distance matrix 
D (McCullagh (2008)). 

If we wish to take account of locus-specific mutation rates, version II of the standard 
model is more appropriate. The profile log likelihood function for this model is 

i{T;Y) = -iogDet(iyg) - ^ ^ iog(y;iygy,) 

r=l 

k , „ .,,,„s n ■ 



-logDet(iyQ) + -Y^logtT{WQDr), 



r=l 



which requires locus-specific squared distance matrices Dr{i,j) — {Yir — Yj 



\2 



jr I 



Although formulation III appeared to be appropriate and natural for this application, 
the model with general S is a total failure because k is much larger than n and the profile 
log likelihood is uninformative. 
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Tractable models intermediate between II and III can be used to take account of 
correlations and to pool information more efficiently. The technique is illustrated here 
by the set of Markov matrices, that is, E is a Green's matrix of the form aibj for i < j 
and E"-'^ is a symmetric Jacobi, or tri-diagonal, matrix (Karlin (1968), Section 3.3). Let 
Yq = and let Qr be the orthogonal projection in 7?." with kernel span(X, F^-i) and 
rank n — p — 1 for r > 1. Conditional on Yi,. . ., Yr-i, the residual log likelihood for F 
based on QrYr /\\QrYr\\ is 

ilogDet(M/Q.) , rank(Qr) iog(y;^Q^y^). 

The full log likelihood is the sum of k similar terms, and the derivatives have the 
same form as those for model II. Since Qr is a random projection, the matrix Vr = 
ntT{{WQrD)^) — t'i:^{WQrD) governing the conditional Fisher information is also ran- 
dom. No closed-form expression is available for the expected value, but symmetry con- 
siderations indicate that the total Fisher information is of order ^rank((5r), directly 
proportional to the number of series. The marginal likelihood for the series in reverse 
order is different, but the Fisher information is the same. Neither marginal likelihood 
coincides exactly with the profile likelihood. 

5. Marginal likelihood and group orbits 

In order to eliminate 9 from the likelihood in the model Y ^ N{X9, F), part of the data is 
ignored. The residual likelihood function is based on the statistic LY ^ N{0, LTL'), where 
L is any linear transformation with kernel X = span(X). To eliminate scalar constants, 
we also ignore scalar multiples and base the likelihood function on the reduced statistic 
or For fc = 1, this technique gives a marginal log likelihood of 

1{T; Y)^- logDet(iyQ) - \og{Y'WQY), 

where W = F^^ and Q^I ~ X{X'WX)-^X'W has rank n - p. Note that [(aF;r) = 
L{r;Y), so the marginal likelihood is constant on scalar multiples of F. Equivalent ver- 
sions of this marginal likelihood function have been given by Bellhouse (1978, 1990), 
Cruddas, Reid and Cox (1989) and Tunnicliffc- Wilson (1989). 

The marginal log likelihood is based on the maximal invariant under the action of a 
certain group on the observation space Y G TZ". The standard residual likelihood asso- 
ciated with the group of translations Y i-^ Y + x with x ^ X leads to the REML log 
likelihood 

ilogDet(VFQ) - ^Y'WQY. 

The maximal invariant can be described in one of two ways, either in terms of X-contrasts 
or in terms of the group orbit which is the coset y + X. When the group is extended 
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to include scalar multiplication, the maximal invariant is reduced and the marginal log 
likelihood is the function 1{V]Y) shown above. 

In the multivariate case Y ^ N{X9, F (g) S), the regression parameter is eliminated, as 
above, by considering an arbitrary linear transformation L: TZ^ — > TZ" with kernel X and 
applying it to each of the columns of Y. The kernel is thus X®'', the group orbits are 
cosets and the multivariate residual log likelihood is 

I logDet(TyQ) - ^ log |S]| - i tr{Y'WQYT,-^). 

If we now extend the group by linear transformations Y ^ Yg with g S GL{TZ''), the 
dependence on S vanishes and the marginal log likelihood is 

^ logDet(VKQ) - '^^^(^'^^ logDet(y'W^Qy) 

(Appendix B). Since this is a log likelihood function, the Bartlett identities are automat- 
ically satisfied, as was observed in Section 2. 

The preceding remarks help to explain the anomalous behavior of the log likelihood 
under model III. For X ~ and fc = 1, each one-dimensional subspacc excluding the origin 
is a group orbit in TZ", so there are as many orbits as there are points on the projective 
sphere in TZ". For general k, the observation space is T^"'^, but a typical group orbit 
has dimension , so the maximal invariant has dimension k{n — k), which is the factor 
governing the rate of increase of the Fisher information. For k > n, there is one group 
orbit that has probability one, so the invariant statistic is degenerate and uninformative. 

The preceding discussion suggests the following question. The action of the group 
GL{TZ^) is such that that the maximal invariant has a distribution independent of S. Can 
the same effect be achieved at less cost by a sub-group? The answer, which is a qualified 
"yes", is now illustrated by the sub-group UTk of upper triangular transformations. 
Taking the series in the order given, the maximal invariant is constructed as follows. For 
each series 1^, compute the residual after linear regression on both X and Yi, . . . ,Yr-i-, 
ignoring scalar multiples. The contribution to the log likelihood function from the series 
Yr is 

ilogDet(tyO.) -l^^^^^\og{Y:WQrYr), 

where Qr is the orthogonal projection in 7^" with inner product matrix W = T^^ and null 
space span(X, Yi, . . . , Yr-i). The contribution to the Fisher information is non-negative, 
but zero for r >n — p. The total log likelihood based on the maximal invariant under the 
upper triangular sub-group is thus 

^' 1 I 1 

J2^iogDetiWQr) log(y>Q,.y,). 

r=l 

The group determines the order in which the series are taken, each order has a different 
maximal invariant and the log likelihood clearly depends on the order. No closed-form 
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expressions are available for the Fisher information, but, by contrast with the behavior 
for GL{TZ^), the Fisher information does not decrease with k. 

Appendix A: Haar moments 

Let be a random orthogonal matrix uniformly distributed with respect to Haar measure 
on the orthogonal group of order n. The value in row r and column j is denoted by HI , so 
the (r, j) component of H"^ is H^Hf using the summation convention for repeated indices. 
By contrast, the (r, s) component of HH' is H^.H^ = Srs, where 5rs is the Kronecker 
symbol for the identity matrix. 

Since —H has the same distribution as H , the moments and cumulants of odd order 
are zero. For n >2, the non-zero moments and cumulants up to order four are 



Subscripts on the left-hand sides are in one-to-one alphabetic correspondence with su- 
perscripts. For a moment or cumulant of order fc, the right-hand side is a sum over 
bi-partitions of {l,...,fc}, that is, ordered pairs of partitions of subscripts and super- 
scripts, all partitions having blocks of size two only. For example, the diagonal bi-partition 
(13|24, 13|24) appears in alphabetic form as Sri^su^''^^''^ i while (12|34, 13|24) appears as 
5rs6tu5^^5->^ ■ The coefficient depends only on the least upper bound of the two partitions. 
Since there are three partitions of four elements into two blocks of size two, there are 
nine bi-partitions of {1, . . . ,4}, the three diagonal elements having one coefficient in the 
fourth moment and the six off-diagonal elements having a different coefficient. Likewise, 
there are 15 partitions of six elements into blocks of size two, so the sixth moment is a 
sum over 15^ ~ 225 bi-partitions. The 15 diagonal pairs have a least upper boimd with 
three blocks, a further 90 pairs have a least upper bound with two blocks and the remain- 
ing 120 pairs have a least upper bound with one block. Thus, there arc three distinct 
coefficients in the sum over bi-partitions of {1, . . . ,6}. 
It follows that 



E(tr(i/2)) ^ ^{HIHD - 5,r&'-/n - 1, 

E(tr2(iJ)) = E(X;x;) = 5rs5''''/n = 1, 

E(tr(i/4)) ^ J^{H:^H:HtHi) 

= 5rs5tu{{n + 1)J™J^*[3] - 5''5'^m/{7i{n - l)(n + 2)) 
= {{n + l){n^ + 2n) - 2n{n + 2))/{n{n -l){n + 2)) = 1, 



[6] 



cum4(i^;,i^i,i^^i^^) 



n(n-l)(n + 2) 

2SrsStuS'^S'''[3] - n6rJtuS'''S^'[6] 
rfi{n~ l)(ri4-2) 
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= 5rsStu{{n + 1)(5'^-M*"[3] ~ <5'^*<5^"[6])/(n(n - l)(n + 2)) 
= {3n'^{n+l)-6n)/{n{n-l){n + 2))=3, 



in agreement with more general formulae for moments of traces given by Diaconis and 
Shahshahani (1994). Finally, for the variance or covariance of log likelihood derivatives 
under model III, let Z consist of the first k columns of H, so that indices r, s,... run 
from 1 to < 71. Then 



Etr{Z'AZ) = EiZiZ^Aij) = S"^"^ S'^ / n = ktr{A)/n, 
E(tr(Z'AZ) tr{Z'BZ)) = E^Z^Z^^A.^Z^ ZIBu) 
= A,jBkiE{ZlZlZ':zi) 

_ k{nk + k-2) tr{A) tr{B) + 2k{n - k) tv{AB) 
~ n(n- l)(n + 2) 



Appendix B: Distribution of the maximal invariant 

Let y be a random matrix of order n x k with density f{y) dy with respect to Lebcsgue 
measure at y £ T?,"'^. In order to calculate the distribution of the maximal invariant 
under the action of GL{TZ^) by right multiplication, we first observe that the action 
on the first k components is weakly transitive. For n = k, there are many group orbits, 
but for continuous distributions, there is a single orbit that has probability one. Under 
standard conditions, the matrix g = F^*^' consisting of the first k rows of Y has full rank, 
so the group clement sends y to a standard configuration or representative orbit 
element Z = Yg^^ in which the leading k rows are equal to Ik- 

The Jacobian of the transformation Y {g,Z) is equal to l^l""'^, so the marginal 
density of Z is 



Simplification of this expression for the Gaussian distribution with covariance (1) gives 
the marginal likelihood function in the form 



and 



cov(tr(Z'AZ),tr(Z'SZ)) 



2k{n - k){tr{AB) - tr{A)tT{B)/n) 
n{n- l)(n + 2) 





In other words, the profile log likelihood (2) coincides with the marginal log likelihood 
based on the maximal invariant. 
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